Audio visual speech source separation via improved context dependent association model

نویسندگان

Alireza Kazemi

Reza Boostani

Fariborz Sobhanmanesh

چکیده

In this paper, we exploit the non-linear relation between a speech source and its associated lip video as a source of extra information to propose an improved audio-visual speech source separation (AVSS) algorithm. The audio-visual association is modeled using a neural associator which estimates the visual lip parameters from a temporal context of acoustic observation frames. We define an objective function based on mean square error (MSE) measure between estimated and target visual parameters. This function is minimized for estimation of the de-mixing vector/filters to separate the relevant source from linear instantaneous or time-domain convolutive mixtures. We have also proposed a hybrid criterion which uses AV coherency together with kurtosis as a non-Gaussianity measure. Experimental results are presented and compared in terms of visually relevant speech detection accuracy and output signal-to-interference ratio (SIR) of source separation. The suggested audio-visual model significantly improves relevant speech classification accuracy compared to existing GMM-based model and the proposed AVSS algorithm improves the speech separation quality compared to reference ICAand AVSS-based methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition

Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...

متن کامل

Robust audio-visual speech synchrony detection by generalized bimodal linear prediction

We study the problem of detecting audio-visual synchrony in video segments containing a speaker in frontal head pose. The problem holds a number of important applications, for example speech source localization, speech activity detection, speaker diarization, speech source separation, and biometric spoofing detection. In particular, we build on earlier work, extending our previously proposed ti...

متن کامل

Using the Bi-modality of Speech for Convolutive Frequency Domain Blind Speech Separation

The problem of blind source separation for the case of convolutive mixtures of speech is considered. A novel algorithm is proposed that exploits the bi-modality of speech. This is achieved by incorporating joint audio-visual features into an existing BSS algorithm for the purpose of improving the convergence rate of the source separation algorithm. The increase in the rate of convergence when u...

متن کامل

Extracting an AV speech source f

We present a new approach to the source separation problem for multiple speech signals. Using the extra visual information of the face speaker, the method aims to extract an acoustic speech signal from other acoustic signals by exploiting its coherence with the speaker’s lip movements. We define a statistical model of the joint probability of visual and spectral audio input for quantifying the ...

متن کامل

Speech extraction based on ICA and audio-visual coherence

We present a new approach to the source separation problem for multiple speech signals. Using the extra visual information of the speaker’s face, the method aims to extract an acoustic speech signal from other acoustic signals by exploiting its coherence with the speaker’s lip movements. We define a statistical model of the joint probability of visual and spectral audio input for quantifying th...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

EURASIP J. Adv. Sig. Proc.

دوره 2014 شماره

صفحات -

تاریخ انتشار 2014

Audio visual speech source separation via improved context dependent association model

نویسندگان

چکیده

منابع مشابه

Improved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition

Robust audio-visual speech synchrony detection by generalized bimodal linear prediction

Using the Bi-modality of Speech for Convolutive Frequency Domain Blind Speech Separation

Extracting an AV speech source f

Speech extraction based on ICA and audio-visual coherence

عنوان ژورنال:

اشتراک گذاری